This document outlines the R code that was used across the lectures for Advanced Statistics course at MSc in Psychological Research Methods programme at the University of Sheffield, UK. The code follows the lectures, but also expands on the content covered by the lecture.

Lecture 1: Linear regression

Introduction

First, we used the dataset that are included in R packages. We can see what data do we have in standard R packages by calling: data() or we can see datasets specific for certain R packages by calling: data(package=‘lme4’).

In this case, we can call pre-loaded dataset cars. More information on this data is available by calling ?cars or help(cars).

data("cars")  #calls the dataset
head(cars)  # prints first 6 observations in the dataset
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

We can use scatter plot to see the raw values.

plot(cars$speed, cars$dist, xlab = "Predictor", ylab = "Outcome")  #As this plot was only used to make an illustration of the linear regression, we changed the labels of x and y-axis. This was done using xlab and ylab parameters. You can change the name of the labels using the same code, eg. xlab='Speed (mph)', ylab='Stopping distance (ft)'
abline(lm(dist ~ speed, data = cars), lwd = 2)  #abline function adds one or more straight lines through the plot that you have active in your environment. lm function is used to fit linear models, where dist is modelled as a function of speed. We also indicate that values for distance and speed can be found in the cars dataset (data = cars). Finally, we specify the thickness of abline function with lwd=2 parameter. 

Data simulation

We can also simulate some data:

set.seed(456)  # we can set starting numbers that are used to generate our random values. Random numbers are rarely truly random: https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))  #We create a new data frame (Babies) where we have Age and Weight as variables. 100 Age values are sampled from random uniform distribution (runif) with lower bound of 1 (minimum) and upper bound of 30 (maximum). 100 Weight values are generated using random normal distribution (rnorm) with mean of 4000 and SD of 500 
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)  # Height is generated using random normal distribution where mean is a function of Age and Weight, while SD is 5. 
Babies$Gender = factor(rbinom(100, 1, 0.5))  # 100 Gender values are generated using random binomial function with equal probability of being assigned one or the other sex category
levels(Babies$Gender) = c("Girls", "Boys")  #We levels function to assign Girls and Boys labels to 1 and 0 levels generated by the function

We can plot and inspect raw data:

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)  # par parameter sets global plotting settings. mfrow indicates number of panels for plots (1 row and 3 columns), bty sets type of box around the plot, mar defines margines around the plot, cex magnifies size of labels, while pch sets type of points used in the plot.
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
plot(Babies$Weight, Babies$Height, xlab = "Weight (grams)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")

We can also check the coding and formatting of the variables in our data frame.

str(Babies)
## 'data.frame':    100 obs. of  4 variables:
##  $ Age   : num  4 7 22 26 24 11 3 9 8 12 ...
##  $ Weight: num  3668 3872 4339 4448 4309 ...
##  $ Height: num  61.5 56.1 59.3 55.2 60.3 ...
##  $ Gender: Factor w/ 2 levels "Girls","Boys": 1 1 2 2 2 2 1 1 1 1 ...

Finally, we can also load the dataset from our local disc drives. The datasets should be placed in your working environment. There are a number of tutorials that guide you how to optimise creation of the new projects, which results in all of your data, code and files being generated and saved at one place. https://r4ds.had.co.nz/workflow-projects.html

inequality <- read.table("inequality.txt", sep = "\t", header = T)  # read a file in table format and create a data frame from it. sep parameter defines separator in the data - tab delimited format in this case, while it could be also anything else, such as ',' - comma separated values, ' ' - space separated values etc. Header = T defines that the names of the variables are in the first row of the database.

# You can also numerous other extensions, suchs spss sav files, however you
# will often need additional packages for this.  Foreign package -
# library(foreign) has read.spss function that can be used to read in spss
# files dataExample<-read.spss('Example.sav', to.data.frame=T)

Linear regression

One predictor

lm1 <- lm(Height ~ Age, data = Babies)  # linear model where Height is modelled as a function of Age. 
lm1$coefficients  # We can print only the coefficients
## (Intercept)         Age 
##   57.025804    0.143174
summary(lm1$coefficients)  #We can also part of the information that is frequently used to judge significance and importance of predictors  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1432 14.3638 28.5845 28.5845 42.8051 57.0258

Two predictors

lm2 <- lm(Height ~ Age + Gender, data = Babies)  # linear model where Height is analysed as a function of Age and Gender
lm2$coefficients
## (Intercept)         Age  GenderBoys 
##  57.5038799   0.1403955  -0.8309449

Main effects and interaction (numerical x categorical)

lm3 <- lm(Height ~ Age * Gender, data = Babies)  # linear model where Height is modelled as a function of Age and Gender, as well as their interaction
lm3$coefficients
##    (Intercept)            Age     GenderBoys Age:GenderBoys 
##     56.1448603      0.2202400      1.8315868     -0.1607307

Main effects and interaction (numerical x numerical)

lm4 <- lm(Height ~ Age * Weight, data = Babies)
lm4$coefficients
##   (Intercept)           Age        Weight    Age:Weight 
## 30.7244904854  0.6913148965  0.0066360329 -0.0001397745

Other information that we get from the linear model

lm1 <- lm(Height ~ Age, data = Babies)
summary(lm1)
## 
## Call:
## lm(formula = Height ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4765  -4.1601  -0.3703   3.9198  12.3842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.02580    1.18751  48.021   <2e-16 ***
## Age          0.14317    0.06426   2.228   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared:  0.04821,    Adjusted R-squared:  0.0385 
## F-statistic: 4.964 on 1 and 98 DF,  p-value: 0.02817

We can calculate R2 step-by-step. First, we need predictions from the model that includes predictor, as this will give us residual sum of squares.

Babies$lm1 = predict(lm1, newdata = Babies)  # predict Height based on our lm1 model.
Babies$diff = Babies$lm1 - Babies$Height  #calculate differences between predicted (lm1) and observed values (Height) 

We can plot these differences using ggplot.

require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm1,
    colour = diff), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm1),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_residual")

A second ingredient for our determination coefficient is sum of squares of total variation in the data. This we can get by building model with intercept only.

lm0 <- lm(Height ~ 1, data = Babies)
summary(lm0)
## 
## Call:
## lm(formula = Height ~ 1, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4165  -4.2284  -0.2062   3.6744  13.5940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.3953     0.5388   110.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.388 on 99 degrees of freedom
Babies$lm0 = predict(lm0, newdata = Babies)  #predict Height based on average value only
Babies$diff2 = Babies$lm0 - Babies$Height  #calculate difference between predicted (lm0) and observed values (Height)

We can plot these differences using ggplot.

ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm0,
    colour = diff2), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_total")

The R2 coefficient is:

1 - (sum(Babies$diff^2)/(sum(Babies$diff2^2)))
## [1] 0.04821098

Improvement in our prediction:

Babies$diff3 = Babies$lm1 - Babies$lm0  #Improvement based on the inclusion of Age as a predictor - differences between the predicted values from (lm1) and intercept only model (lm0)
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = lm1, ymax = lm0, colour = diff3),
    size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0), size = 1.2) + geom_line(data = Babies,
    aes(Age, y = lm1), size = 1.3, linetype = "dotdash") + geom_point(data = Babies,
    aes(x = Age, y = Height), size = 0.9) + ylab("Height") + xlab("Age") + ggtitle("Improvement") +
    theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))

F-value:

(sum(Babies$diff3^2)/1)/(sum(Babies$diff^2)/98)
## [1] 4.963995

Practical aspect of the lecture

inequality <- read.table("inequality.txt", sep = ",", header = T)  #Reading the data in R
head(inequality)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   share_population_in_metro_areas share_population_with_high_school_degree
## 1                            0.64                                    0.821
## 2                            0.63                                    0.914
## 3                            0.90                                    0.842
## 4                            0.69                                    0.824
## 5                            0.97                                    0.806
## 6                            0.80                                    0.893
##   share_non_citizen share_white_poverty gini_index share_non_white
## 1              0.02                0.12      0.472            0.35
## 2              0.04                0.06      0.422            0.42
## 3              0.10                0.09      0.455            0.49
## 4              0.04                0.12      0.458            0.26
## 5              0.13                0.09      0.471            0.61
## 6              0.06                0.07      0.457            0.31
##   share_voters_voted_trump hate_crimes_per_100k_splc
## 1                     0.63                0.12583893
## 2                     0.53                0.14374012
## 3                     0.50                0.22531995
## 4                     0.60                0.06906077
## 5                     0.33                0.25580536
## 6                     0.44                0.39052330
##   avg_hatecrimes_per_100k_fbi
## 1                   1.8064105
## 2                   1.6567001
## 3                   3.4139280
## 4                   0.8692089
## 5                   2.3979859
## 6                   2.8046888

Probability density plots: outcomes

par(mfrow = c(1, 2))
plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main = "Crimes per 100k")  #probability density for hate crimes outcomes. na.rm=T indicates that only cases that are not NAs should be returned
plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main = "Average Crimes")

Probability density plots: predictors

par(mfrow = c(1, 2))
plot(density(inequality$median_household_income, na.rm = T), main = "Income")
plot(density(inequality$gini_index, na.rm = T), main = "Gini")

Scatter plots:

par(mfrow = c(1, 2))
plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi,
    xlab = "Median household income", ylab = "Avg hatecrimes")
plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi, xlab = "Gini index",
    ylab = "Avg hatecrimes")

cor(inequality[, c(2, 8, 12)], use = "complete.obs")  #calculate correlations where we use only rows that have all values (no NAs). We are only focusing on columns: 2,8 and 12. inequality is a data frame with (n rows, m columns), so we can access only first row by calling inequality[1,], or just first column by inequality[,1], first row and first column would be inequality[1,1]. When using inequality[,c(2,8,12)] am asking for all rows but only second, eight and twelfth column.
##                             median_household_income gini_index
## median_household_income                   1.0000000 -0.1497444
## gini_index                               -0.1497444  1.0000000
## avg_hatecrimes_per_100k_fbi               0.3182464  0.4212719
##                             avg_hatecrimes_per_100k_fbi
## median_household_income                       0.3182464
## gini_index                                    0.4212719
## avg_hatecrimes_per_100k_fbi                   1.0000000

Stepwise approach in building linear model

mod1 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income, data = inequality)  #modelling avg hatecrimes as a function of median_household_income
summary(mod1)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income, 
##     data = inequality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3300 -1.1183 -0.1656  0.7827  7.7762 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -9.564e-01  1.448e+00  -0.661   0.5121  
## median_household_income  6.054e-05  2.603e-05   2.326   0.0243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.642 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1013, Adjusted R-squared:  0.08256 
## F-statistic: 5.409 on 1 and 48 DF,  p-value: 0.0243

Adding a new predictor and comparing it with the previous model.

mod2 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index, data = inequality)
anova(mod2)  #anova type comparison of the model that allows us to see whether newly added predictor explained additional variation in the outcome. We can see changes in the Sum Sq for residuals and for the main effects
## Analysis of Variance Table
## 
## Response: avg_hatecrimes_per_100k_fbi
##                         Df Sum Sq Mean Sq F value    Pr(>F)    
## median_household_income  1 14.584  14.584  7.0649 0.0107093 *  
## gini_index               1 32.389  32.389 15.6906 0.0002517 ***
## Residuals               47 97.020   2.064                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also test whether there is need to adjust influence of median household income across the values of gini index - interaction between our predictors

mod3 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index, data = inequality)
anova(mod1, mod2, mod3)  # comparison between three models
## Analysis of Variance Table
## 
## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income
## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index
## Model 3: avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     48 129.41                                  
## 2     47  97.02  1    32.389 19.742 5.539e-05 ***
## 3     46  75.47  1    21.550 13.135 0.0007218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of the model:

summary(mod3)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16383 -0.96279 -0.01053  0.88008  2.75763 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         9.056e+01  3.070e+01   2.950 0.004986 ** 
## median_household_income            -1.724e-03  4.965e-04  -3.472 0.001136 ** 
## gini_index                         -2.001e+02  6.666e+01  -3.002 0.004330 ** 
## median_household_income:gini_index  3.907e-03  1.078e-03   3.624 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.281 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4417 
## F-statistic: 13.92 on 3 and 46 DF,  p-value: 1.367e-06

We can visualise the interactions using interactions package.

require(interactions)
## Loading required package: interactions
## Warning: package 'interactions' was built under R version 4.1.1
interact_plot(mod3, pred = median_household_income, modx = gini_index, plot.points = T)

Interactive visualisation:

p <- ggplot(simulatedD, aes(median_household_income, Avg_hate_pred, color = gini_index,
    frame = gini_index)) + geom_point()  # Using ggplot to make the plot
plotly::ggplotly(p)  #make it interactive using plotly

Model diagnostics

Quantile-quantile plot: Normality

car::qqPlot(mod3)

## [1]  9 18

Homoscedasticity: Linearity

car::residualPlot(mod3, type = "rstudent")  # we can call car package without loading it into R environment by calling car::

Outliers:

car::influenceIndexPlot(mod3)  #influence of individual observation on our model

Autocorrelation of residuals

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
stats::acf(resid(mod3))

Predicted versus observed data:

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(predict(mod3), mod3$model$avg_hatecrimes_per_100k_fbi, xlab = "Predicted values (model 3)",
    ylab = "Observed values (avg hatecrimes)")  #The x-axis is predict(mod3) - predict values based on our model. The y-axis is the data that was used to build the model. I decided to take these values from our model frame (mod3), instead of original data frame (inequalities). 

Finally, we can subset our model and exclude data point that might skew our results

mod3.cor <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index,
    data = inequality, subset = -9)  #We subset our data to exclude row 9
summary(mod3.cor)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality, subset = -9)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90964 -0.94966 -0.08526  0.73470  2.55257 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         4.429e+01  3.080e+01   1.438    0.157
## median_household_income            -7.992e-04  5.228e-04  -1.529    0.133
## gini_index                         -9.668e+01  6.725e+01  -1.438    0.157
## median_household_income:gini_index  1.837e-03  1.145e-03   1.604    0.116
## 
## Residual standard error: 1.154 on 45 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.07041 
## F-statistic: 2.212 on 3 and 45 DF,  p-value: 0.09974